Final Project

Author

Alekhya Chaparala, Jacob Ausubel, Juliet Hayes, Saloni Bhatia

Published

December 14, 2023

INTRODUCTION

Though sustained growth in India since the 1980s has pulled millions of its citizens out of poverty, inequality in the country has steadily grown in recent years. A significant portion of this disparity exists between rural and urban regions, as urban growth has accounted for 80% of the total fall in poverty in the country (UN India 2023). Rather than solely relying on national-level development and growth policies, in order for the country to fulfill its commitment to eradicating poverty it is necessary to develop robust methods of targeting high-poverty areas, especially in rural areas. In order to do this, access to highly geographically-specific information about communities experiencing poverty is essential.

In this project, we rely on data that allow us to look at poverty indicators at the smallest unit of analysis available in India (‘shrid’-level) to predict areas of high poverty in rural areas. Specifically, we focus on the efficacy of night-time light data as a predictor, which is a proxy for poverty that economists have increasingly relied on in recent years. Using exploratory data analysis, geospatial analysis, and machine learning techniques, we hope to illuminate where resources would be best targeted in the country for decision makers. Further, we hope to reveal some of the implications that this might have, especially in the realm of remediating social inequities, such as for members of scheduled castes.

LITERATURE REVIEW

Several scholars have already used night time lights to identify and predict levels of poverty. However, their academic papers often have distinct methodologies compared to one another.

First, geographic units of observation differ between studies. Elvidge et al. (2008) use satellite data to predict levels of poverty in over a hundred countries. Their results are then aggregated into a global estimate for the total number of people living in poverty (Elvidge et al. 2008). Noor et al. (2008), meanwhile, use night time light data to predict levels of poverty in Administrative 1 units (n = 338) in 37 African countries. Wang et al. (2012) predict poverty levels in 31 provinces in China. Lastly, Subash et al. (2018) predict rural poverty rates in states in India.

In much of the literature, the geographic unit of observation is fairly broad. By contrast, Xu et al. (2022) have a much narrower focus: using remote sensing data to identify poverty counties and to predict poverty incidence in the Guizhou province of China.

Second, measures of poverty differ from study to study. Noor et al. (2008), for example, focus on household asset variables. Rather than including lots of variables, the authors use a technique called principal component analysis (PCA) to create an aggregated measure of asset-based poverty (Noor et al. 2008). PCA is a technique for transforming a large set of variables into a smaller one that nevertheless contains most of the information in the larger set. Similarly, Wang et al. (2012) use PCA to combine 17 socioeconomic indexes into an integrated poverty index (IPI). By contrast, Subash et al. (2018) use a much simpler measure, per-capita GDP at the state level, as their dependent variable.

Third, the authors of the studies use very different modeling strategies. Subash et al. (2018), for example, use a machine learning method called an artificial neural network (ANN) to predict rural poverty at the state level. Xu et al. (2022), meanwhile, use support vector machines to classify counties based on levels of poverty, and use regression trees to predict the incidence of poverty in counties.

DATA SOURCES

We relied on data from the open data platform administered by the Development Data Lab. The first dataset we employed in our analysis is the Socio Economic Caste Census of 2012, which includes various household parameters including income, occupation, land ownership, educational attainment, and caste. We also included the 2011 Population Census Abstract, which provides information about the number of households, gender breakdown, and classification of workers.

For night-time light data, we used the annual global VIIRS nighttime lights, which capture the average light produced each month using satellite data. Finally, for our poverty data, we used SECC Rural consumption data, which include rural consumption and poverty rate estimates from 2011. As discussed above, the Development Data Lab provides these data at the shrid level, meaning that we can analyze these variables at a smaller unit of analysis- the village/town level - which is at a higher-resolution than the Indian government official data sets allow for.

[Load Packages]

#Loading packages
library(tidyverse)
library(readr)
library(googledrive)
library(tidymodels)
library(caret)
library(rpart)
library(dplyr)
library(recipes)
library(patchwork)
library(yardstick)
library(parsnip)
library(vip)
library(GGally)
library(sf)
library(tidyverse)
library(janitor)
library(lubridate)
library(tidycensus)
library(httr)
library(jsonlite)
library(dotenv)
library(here)
library(testthat)
library(purrr)
library(roxygen2)
library(codetools)
library(tidymodels)
library(ggplot2)
library(vip)
library(GGally)
library(caret)
library(yardstick)
library(broom)
library(magrittr)
library(dplyr)
library(randomForest)
library(parsnip)
library(rpart.plot)
library(patchwork)
library(purrr)
library(rgdal)
#setting up drive 
drive_auth(
  email = gargle::gargle_oauth_email(),
  path = NULL,
  subject = NULL,
  scopes = "drive",
  cache = gargle::gargle_oauth_cache(),
  use_oob = gargle::gargle_oob_default(),
  token = NULL
)

METHODOLOGY

For this study, we have used the Socioeconomic High-resolution Rural-Urban Geographic Platform for India (SHRUG) data, an open data platform looking at socioeconomic development across 600,000 villages and towns in India (Asher et al. 2022). The units of analysis are “shrids” in India, which are typically at the village/town level (specific to SHRUG data). Shrids provide a consistent geographic identifier across space from 1990 to present (Asher et al. 2021).

We chose to work with four datasets for the analysis: (1) SECC consumption, (2) Night Lights, (3) Social and Economic Caste Census and (4) Census Abstract data. The dependent/prediction variable is secc_pov_rate_rural (taken from the SECC consumption data set). It gives the estimated share of rural households consuming less than 31 INR ($0.37) per day in 2012. The key independent variable is viirs_annual_mean (taken from the night light data set), which is the average night light across rural India in 2012. The predictors (taken from the social and economic caste census and census abstract data sets) include a number of variables on education, employment, housing, wealth, and caste in India. To ensure that the supervised machine learning models work well, we decided to use all variables across the two datasets.

There are two final data frames (one for each state; Assam and Madhya Pradesh) is a merged dataset is a merged version of the above-mentioned data sets. More information on the data has been provided in the Data Sources section.

Supervised Machine Learning Models: 1) Decision Tree Models for Classification 2) Linear Regression: The easily interpretable linear regression model has been chosen to check if there is a linear relationship between the poverty and night lights. 3) Least Absolute Shrinkage and Selection Operator (LASSO) model: The large number of predictor variables in our final data frame have been eliminated using the LASSO model as it helps in variable selection.

Note that observations with missing values were omitted from both the Assam and Madhya Pradesh data frames. The reason is to allow the machine learning models to run more smoothly. We are skeptical that this would have much of an effect on results, as most variables had at most 5-6% missing data.

POLICY RELEVANCE

Our analysis is largely oriented towards two states in India - Assam and Madhya Pradesh. We chose Assam because one-third of the state’s population lives in poverty, and its geographical position in the country means that it is largely removed from the major production centers (Testbook 2023). Further, the state is one of the most vulnerable in the country to climate change, and therefore assessing which communities in the state are most in need of investment will become increasingly important, as poverty is directly related to climate resilience (Mohady and Wadhawan, 2021). We also chose Madhya Pradesh as it is one of the states with the highest levels of inequality, as measured by the Gini coefficient (Chandrasekhar et al. 2021).

In both of these contexts, we felt that a machine learning approach would be a helpful tool for decision makers. The granularity of the shrid dataset will hopefully allow for effective targeting, as it illuminates the specific communities that are most in need of resources and investment. Moreover, in assessing the utility of the night time lights dataset in these specific states and using such a small unit of analysis, we hope to be able to contribute an important data point in this literature by either affirming its generalizability or highlighting some of the possible limitations of these data that economists are increasingly relying on as a proxy for economic activity.

[Load Datasets]

#nightlight (viirs) data
viirs_drive <- drive_download(file = "viirs_annual_shrid.csv", overwrite = TRUE)
viirs_data <- read_csv("viirs_annual_shrid.csv")
file.remove("viirs_annual_shrid.csv")
[1] TRUE
#filter viirs data to 2012, keep average values (not median values)
viirs_data <- viirs_data %>%
  filter(year == 2012) %>%
  filter(category == "average-masked")

#poverty + consumption data 
secc_cons_drive <- drive_download(file = "secc_cons_rural_shrid1.csv", overwrite = TRUE)
secc_cons <- read_csv("secc_cons_rural_shrid1.csv")
file.remove("secc_cons_rural_shrid1.csv")
[1] TRUE
#social and economic caste census data
secc_caste_drive <- drive_download(file = "secc_rural_shrid.csv", overwrite = TRUE)
secc_caste <- read_csv("secc_rural_shrid.csv")
file.remove("secc_rural_shrid.csv")
[1] TRUE
#census abstract data (2011)
census_abstract_drive <- drive_download(file = "census_abstract.csv", overwrite = TRUE)
census_abstract <- read_csv("census_abstract.csv")
file.remove("census_abstract.csv")
[1] TRUE
#remove intermediate dfs
rm(viirs_drive)
rm(secc_cons_drive)
rm(secc_caste_drive)
rm(census_abstract_drive)

[Parse Dataframes]

#viirs
viirs_data <- viirs_data %>%
  mutate(shrid2_clone = shrid2) %>%
  separate(shrid2_clone, into = c("YY", "SS", "DDD", "sssss", "TTTTTT"), sep = "-", extra = "merge") 

#secc_cons
secc_cons <- secc_cons %>%
  mutate(shrid2_clone = shrid2) %>%
  separate(shrid2_clone, into = c("YY", "SS", "DDD", "sssss", "TTTTTT"), sep = "-", extra = "merge") 

#secc_caste
secc_caste <- secc_caste %>%
  mutate(shrid2_clone = shrid2) %>%
  separate(shrid2_clone, into = c("YY", "SS", "DDD", "sssss", "TTTTTT"), sep = "-", extra = "merge") 

#census abstract
census_abstract <- census_abstract %>%
  mutate(shrid2_clone = shrid2) %>%
  separate(shrid2_clone, into = c("YY", "SS", "DDD", "sssss", "TTTTTT"), sep = "-", extra = "merge") 

[Create merged dataframes for each state]

#create state-specific versions of all four dataframes:

#madhya pradesh (State Code = 23)
viirs_data_mp <- viirs_data %>%
  filter(SS == 23)
secc_cons_mp <- secc_cons %>%
  filter(SS == 23)
secc_caste_mp <- secc_caste %>%
  filter(SS == 23)
census_abstract_mp <- census_abstract %>%
  filter(SS == 23)

#merge madhya pradesh data-frames
mp_list <- list(viirs_data_mp, secc_cons_mp, secc_caste_mp, census_abstract_mp)

madhya_pradesh <- Reduce(function(x, y) full_join(x, y, by = 'TTTTTT'), mp_list)

#assam (State Code = 18)
viirs_data_as <- viirs_data %>%
  filter(SS == 18)
secc_cons_as <- secc_cons %>%
  filter(SS == 18)
secc_caste_as <- secc_caste %>%
  filter(SS == 18)
census_abstract_as <- census_abstract %>%
  filter(SS == 18)

#merge assam data-frames
as_list <- list(viirs_data_as, secc_cons_as, secc_caste_as, census_abstract_as)
assam <- Reduce(function(x, y) full_join(x, y, by = 'TTTTTT'), as_list)
 #merge all data frames together

#remove duplicates
assam<- select(assam, -matches("_data2"))

[Madhya Pradesh: Data Cleaning / ML Set-Up]

#checking for missing values merged MP dataset 
#is_na_madhya_pradesh <- is.na(madhya_pradesh)
#print(is_na_madhya_pradesh)

#There are missing values 
#any_missing_madhya_pradesh <- any(is.na(madhya_pradesh))
#print(any_missing_madhya_pradesh)

#filter to 2012
madhya_pradesh <- madhya_pradesh %>%
  filter(year == 2012)

#create testing and training datasets 
set.seed(12152023)
mp_split <- initial_split(data = madhya_pradesh, prop = 0.7)
mp_train <- training(x = mp_split)
mp_test <- testing(x = mp_split)

#dropping missing values (MP)
mp_train <-
  mp_train %>%
  na.omit()

#dropping missing values (MP)
mp_test <-
  mp_test %>%
  na.omit()

[Assam: Data Cleaning / ML Set-Up]

#checking for missing values merged Assam dataset 
#is_na_assam <- is.na(assam)
#print(is_na_assam)

#There are missing values 
#any_missing_assam <- any(is.na(assam))
#print(any_missing_assam)

#filter to 2012
assam <- assam %>%
 filter(year == 2012)

#create testing and training datasets 
set.seed(12152023)
as_split <- initial_split(data = assam, prop = 0.7)
as_train <- training(x = as_split)
as_test <- testing(x = as_split)

#dropping missing values (assam)
as_train <-
  as_train %>%
  na.omit()

#dropping missing values (assam)
as_test <-
  as_test %>%
  na.omit()

DATA WRANGLING & VISUALIZATIONS

We performed a series of data visualizations to understand the relationship between poverty and potential key predictors. The predictor variables we selected for this initial exploratory data analysis include: night light distribution, education, scheduled caste/scheduled tribe population , literacy rate, and agricultural employment. The key dependent variable in this visualizations is the shrid-level rural poverty rate, i.e. the percentage of households in the shrid which fall below the poverty line. We created separate graphs for both study states, and present them side-by-side in panels below.

The first panel visualizes the distribution of shrid-level rural poverty in each study state using histograms. The distribution of below-poverty line households follows a roughly normal distribution in Madhya Pradesh, while being slightly right-skewed in Assam. In both states, we see that the median proportion of households in poverty at the shrid level is between 40-50%, confirming a significant degree of economic inequality in both states. The second panel plots average annual shrid-level night light radiance against shrid-level rural poverty; it is difficult to discern a meaningful relationship here, the night light values are clustered below 5 units. However, there are outliers in each state; the maximum night light value in Madhya Pradesh is over 35 units, while the maximum night light value in Assam is slightly over 250 units. Outlier night light values are found at varying levels of poverty.

The third panel illustrates an inverse relationship between middle school graduation rates and shrid poverty rate in both states, which is consistent with established literature on the relationship between education and poverty (Spada et. al 2023). Similarly, the fifth panel illustrates an inverse relationship between poverty and literacy rate in both Madhya Pradesh and Assam.

The fourth panel illustrates the relationship between the proportion of a shrid’s population which belong to a scheduled caste or scheduled tribe community and poverty. In the Madhya Pradesh graph, we see that a higher proportion of scheduled caste/tribe residents corresponds to a higher shrid poverty rate, which is consistent with existing literature which has demonstrated higher poverty rates among scheduled caste/tribe communities (Meenakshi et. al, 2000). Interestingly, the Assam graph does not demonstrate a linear relationship between caste and poverty. Shrid poverty rates in Assam remain fairly constant at around 35% for shrids with anywhere from 25-75% of the population belonging to a scheduled caste/tribe community. Poverty rates are highest at low and high percentages of SC/ST residents, perhaps suggesting that caste-segregated shrids in Assam face more poverty than shrids with more caste integration.

The final panel illustrates the relationship between shrid poverty rate and agricultural employment. In Madhya Pradesh, shrid poverty rate is inversely related to the percentage of workers who are employed in agriculture until approximately 50%. Past 50% agricultural employment, shrid poverty rate appears to rise. In Assam, the relationship between shrid poverty rate and agricultural is non-linear; while shrids with the highest percentage of agricultural workers have the highest poverty rate, agricultural employment may not be the best predictor of poverty in this state.

[Data Visualization w/ Training Dataset]

#A: Histograms of Poverty Rate by State

#Madhya Pradesh
plot1 <- ggplot(mp_train, aes(x=secc_pov_rate_rural)) + 
  geom_histogram() +
  geom_vline(xintercept = mean(mp_train$secc_pov_rate_rural, na.rm = TRUE), col = "red") +
  xlab("% of Shrid Households in Poverty") +
  ylab("Count") +
  labs(title = "Madhya Pradesh") +
  theme_minimal() + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  )

#Assam
plot2 <- ggplot(as_train, aes(x=secc_pov_rate_rural)) + 
  geom_histogram() +
  geom_vline(xintercept = mean(as_train$secc_pov_rate_rural, na.rm = TRUE), col = "red") +
  xlab("% of Shrid Households in Poverty") +
  ylab("Count") +
  labs(title = "Assam") +
  theme_minimal() + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  )

#side-by-side
plot_combined <- plot1 + plot2 +
  plot_annotation(title = "Rural Poverty Distribution") 
print(plot_combined)

#B: Poverty Rate vs Night Lights 

#Madhya Pradesh

plot3 <- mp_train %>%
  ggplot(aes(x = viirs_annual_mean, y = secc_pov_rate_rural)) +
  geom_point(alpha = 0.25) + 
  scale_x_continuous(
    name = "Average Night Light Radiance (nW/cm^2/sr)"
  ) +
  scale_y_continuous(
    name = "% of Shrid Households in Poverty"
  ) +
  labs(title = "Madhya Pradesh") +

  theme_minimal() +
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  )

#Assam
plot4 <- as_train %>%
   ggplot(aes(x = viirs_annual_mean, y = secc_pov_rate_rural)) + 
  geom_point(alpha = 0.25) +
  scale_x_continuous(
    name = "Average Night Light Radiance (nW/cm^2/sr)"
  ) +
  scale_y_continuous(
    name = "% of Shrid Households in Poverty"
  ) +
   labs(title = "Assam") +
  theme_minimal() + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  )

#side-by-side
plot_combined <- plot3 + plot4  +
  plot_annotation(title = "Shrid Poverty Rate vs. Night Light")
plot_combined

#C: Poverty Rate vs. Middle School Education 

#Madhya Pradesh
plot5 <- mp_train %>%
  ggplot(aes(x = ed_mid_share, y = secc_pov_rate_rural)) +
  geom_smooth(alpha = 0.5) +
  labs(title = "Madhya Pradesh") +
  xlab("Share of Rural Population with Middle School Education") +
  ylab("Percent of Rural Households Below Poverty Line") +
  theme_minimal()  + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  )

#Assam
plot6 <- as_train %>%
  ggplot(aes(x = ed_mid_share, y = secc_pov_rate_rural)) +
  geom_smooth(alpha = 0.5) +
  labs(title = "Assam") +
  xlab("Share of Rural Population with Middle School Education") +
  ylab("Percent of Rural Households Below Poverty Line") +
  theme_minimal()  + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  )

#side-by-side
plot_combined <- plot5 + plot6 +
  plot_annotation(title = "Middle School Graduate Rate vs. Poverty Rate")
plot_combined

#D: Poverty Rate by Proportion of Population that is SC/ST

#create SC/ST proportion variable -- create duplicate train_EDA so as to not add extra columns to training df which are not present in test df
mp_train_EDA <- mp_train %>%
  mutate(sched_prop = (pc11_pca_p_sc + pc11_pca_p_st) / pc11_pca_tot_p)
as_train_EDA <- as_train %>%
  mutate(sched_prop = (pc11_pca_p_sc + pc11_pca_p_st) / pc11_pca_tot_p)

#Madhya Pradesh
plot7 <- mp_train_EDA %>%
ggplot(aes(x = sched_prop, y = secc_pov_rate_rural)) +
geom_smooth(alpha = 0.5) +
theme_linedraw() +
labs(
  x = "% Scheduled Caste/Tribe",
  y = "Percent of Rural Households Below Poverty Line",
  title = "Madhya Pradesh" 
)  + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  ) +
  theme_minimal()

#Assam
plot8 <- as_train_EDA %>%
ggplot(aes(x = sched_prop, y = secc_pov_rate_rural)) +
geom_smooth(alpha = 0.5) +
theme_linedraw() +
labs(
  x = "% Scheduled Caste/Tribe",
  y = "Percent of Rural Households Below Poverty Line",
  title = "Assam") + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  ) +
  theme_minimal()

#side-by-side
plot_combined <- plot7 + plot8 +
  plot_annotation(title = "Scheduled Caste/Tribe Population vs. Poverty Rate")
plot_combined

#E: Poverty Rate by Literacy Rate
#create total literacy rate variable
mp_train_EDA <- mp_train %>%
  mutate(lit_rate = (pc11_pca_p_lit) / pc11_pca_tot_p)
as_train_EDA <- as_train %>%
  mutate(lit_rate = (pc11_pca_p_lit) / pc11_pca_tot_p)

#Madhya Pradesh
plot9 <- mp_train_EDA %>%
ggplot(aes(x = lit_rate, y = secc_pov_rate_rural)) +
geom_smooth(alpha = 0.5) +
theme_linedraw() +
labs(
  x = "Literate Proportion of Population",
  y = "Poverty Rate",
  title = "Madhya Pradesh") + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  ) +
  theme_minimal()

#Assam
plot10 <- as_train_EDA %>%
ggplot(aes(x = lit_rate, y = secc_pov_rate_rural)) +
geom_smooth(alpha = 0.5) +
theme_linedraw() +
labs(
  x = "Literate Proportion of Population",
  y = "Poverty Rate",
  title = "Assam") + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  ) +
  theme_minimal()

#side-by-side
plot_combined <- plot9 + plot10 +
  plot_annotation(title = "Literacy Rate vs. Poverty Rate")
plot_combined

#F: Poverty Rate by Proportion of Agricultural Employment

#create var for proportion of workers who are cultivators
mp_train_EDA <- mp_train %>%
  mutate(ag_prop = pc11_pca_main_cl_p/pc11_pca_tot_work_p)
as_train_EDA <- as_train %>%
  mutate(ag_prop = pc11_pca_main_cl_p/pc11_pca_tot_work_p)

#Madhya Pradesh
plot11 <- mp_train_EDA %>%
ggplot(aes(x = ag_prop, y = secc_pov_rate_rural)) +
geom_smooth(alpha = 0.5) +
theme_linedraw() +
labs(
  x = "Proportion of Workers who are Cultivators",
  y = "Poverty Rate",
  title = "Madhya Pradesh") + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  ) +
  theme_minimal()

#Assam
plot12 <- as_train_EDA %>%
ggplot(aes(x = ag_prop, y = secc_pov_rate_rural)) +
geom_smooth(alpha = 0.5) +
theme_linedraw() +
labs(
  x = "Proportion of Workers who are Cultivators",
  y = "Poverty Rate",
  title = "Assam") + 
  theme(
    axis.title = element_text(size = 8), 
    axis.text = element_text(size = 8),
    plot.title = element_text(size = 10)
  ) +
  theme_minimal()

#side-by-side
plot_combined <- plot11 + plot12  +
  plot_annotation(title = "Agricultural Employment vs. Poverty Rate")
plot_combined

GEOSPATIAL ANALYSIS

As we can see from the graph of India below, rural poverty rates appear to be especially high in Northeast India, where Assam is located.

The maps of Assam and Madhya Pradesh illustrate how there is internal variation in rural poverty rates within states. For example, Assam is located in a high-poverty area and yet there are some shrids within it with relatively low poverty rates.

[India (overall): Geospatial Analysis]

#Reading in shrid shape files
shrid2_open_cpg_drive <- drive_download(file = "shrid2_open_ja.cpg", overwrite = TRUE)
shrid2_open_dbf_drive <- drive_download(file = "shrid2_open_ja.dbf", overwrite = TRUE)
shrid2_open_prj_drive <- drive_download(file = "shrid2_open_ja.prj", overwrite = TRUE)
shrid2_open_shp_drive <- drive_download(file = "shrid2_open_ja.shp", overwrite = TRUE)
shrid2_open_shx_drive <- drive_download(file = "shrid2_open_ja.shx", overwrite = TRUE)
#Reading in shape file and perform
  #a data transformation
shrid <- st_read("shrid2_open_ja.shp", quiet = TRUE)
shrid <- st_transform(shrid, crs = 4326)

#Reading in shape file
shrid_v2 <- readOGR("shrid2_open_ja.shp")
OGR data source with driver: ESRI Shapefile 
Source: "/Users/jacobausubelold/Desktop/Final project PPOL-6803/FinalGroupProject/shrid2_open_ja.shp", layer: "shrid2_open_ja"
with 576153 features
It has 7 fields
#Merging together SECC consumption data
#and shrid boundaries
secc_merged <- merge(secc_cons, shrid, by = c("shrid2"))
india_map <- merge(shrid_v2, secc_merged, by = c("shrid2"))

#Converting data
india_map <- st_as_sf(india_map)

#Removing missing data from poverty rate column
india_map <- india_map %>%
  filter(!is.na(secc_pov_rate_rural))

#Continuing to transform data
#into form in which we can map into
india_map <- st_transform(india_map, crs = 4326)
#Creating map of rural poverty in India
india_map %>% 
  ggplot() +
  geom_sf(color = NA, aes(fill = secc_pov_rate_rural)) +
  labs(title = "Mapping poverty in India",
       fill = "Rural poverty rate")

[Madhya Pradesh: Geospatial Analysis]

#Creating a map like the previous one but filtering
  #to just shrids in MP
india_map %>%
  filter(SS == 23) %>% 
  ggplot() +
  geom_sf(color = NA, aes(fill = secc_pov_rate_rural)) +
  labs(title = "Mapping poverty in Madhya Pradesh",
       fill = "Rural poverty rate")

[Assam: Geospatial Analysis]

#Creating a map like the previous one but filtering
  #to just shrids in Assam
india_map %>%
  filter(SS == 18) %>% 
  ggplot() +
  geom_sf(color = NA, aes(fill = secc_pov_rate_rural)) +
  labs(title = "Mapping poverty in Assam",
       fill = "Rural poverty rate")

#Removing files for shape files
file.remove("shrid2_open_ja.cpg")
[1] TRUE
file.remove("shrid2_open_ja.dbf")
[1] TRUE
file.remove("shrid2_open_ja.prj")
[1] TRUE
file.remove("shrid2_open_ja.shp")
[1] TRUE
file.remove("shrid2_open_ja.shx")
[1] TRUE

DISCUSSION OF RESULTS

  1. Regression

Our team created linear regression and LASSO models to predict rural poverty rates in Assam and Madhya Pradesh. In the next couple of paragraphs, we outline the process used to create the models. We also explain how the models can be used to explain the importance (or lack thereof) of night light data in predicting rural poverty rates.

As you will soon see, night lights appear to be more predictive of rural poverty rates in Madhya Pradesh than in Assam.

A. Assam

We created linear regression and LASSO models to predict rural poverty rates in Assam. Both models use 10-fold cross validation. For both linear regression and LASSO, we selected the models with the smallest RMSE.

The best linear regression model (i.e. the one with the lowest RMSE) is Preprocessor 1 Model 1, with a RMSE of 0.1203944. Meanwhile, the best LASSO model (i.e. the one with the lowest RMSE) is Preprocessor 1 Model 7, with a RMSE of 0.1199653. The best LASSO model includes a penalty of 4.641589e-04.

For the linear regression model, we demonstrated how to get estimated coefficients for viirs_annual_mean, i.e. the night light variable. There are 10 different estimates, one for each of the folds. Most of the coefficients are somewhere between -0.00055 and -0.00070 and have p-values greater than 0.05. We conclude that the effect of night lights in the linear regression models for Assam is not statistically significant.

For the LASSO model, we demonstrated how to find the importance of viirs_annual_mean, i.e. the night light variable. For importance, it has a coefficient of 0.0002283624 and a negative sign. The very small coefficient provides further evidence that night light is not very predictive of rural poverty rates in Assam.

We feel that the LASSO model should be our final model for predicting poverty in Assam. The reason is that it has a slightly smaller RMSE than the linear regression model (0.1199653 vs. 0.1203944).

[Regression]

[Assam]

# Filter only numeric columns, dropping missing values
as_train <- as_train %>%
  select_if(is.numeric) %>%
  na.omit()

as_test <- as_test %>%
  select_if(is.numeric) %>%
  na.omit()
#Dropping a few more variables
as_train <- as_train %>%
  select(-c(viirs_annual_min, viirs_annual_max, viirs_annual_sum,
    viirs_annual_num_cells, secc_cons_pc_rural,
    secc_cons_pc_rural, secc_pov_rate_tend_rural, secc_cons_rural, secc_hh, 
    `_mean_p_miss.x`, `_mean_p_miss.y`, `_mean_p_miss`))
as_test <- as_test %>%
  select(-c(viirs_annual_min, viirs_annual_max, viirs_annual_sum,
    viirs_annual_num_cells, secc_cons_pc_rural,
    secc_cons_pc_rural, secc_pov_rate_tend_rural, secc_cons_rural, secc_hh,
    `_mean_p_miss.x`, `_mean_p_miss.y`, `_mean_p_miss`))
# Assam: Since lasso was giving error message for vars with 0 variance, this is to check which vars have 0 variance

#near_zero_vars <- nearZeroVar(as_train, saveMetrics = TRUE)
#print(near_zero_vars)

#near_zero_var_names <- names(as_train)[near_zero_vars$zeroVar]
#print(near_zero_var_names)

# dropping some columns that are not useful and were not working with step_scale()
#as_train <-
  #as_train %>%
  #select(-c(category, SS.x, `_target_weight_share.x`, YY.y, bond_lab_share, `_core_p_miss.y`, `_target_group_max_weight_share.y`, SS.x.x, `_target_group_max_weight_share`, SS.y.y, YY.x, `_core_p_miss.x`, `_target_group_max_weight_share.x`, SS.y, scav_share, `_target_weight_share.y`, YY.x.x, `_target_weight_share`, YY.y.y, year, shrid2.x, shrid2.x.x, shrid2.y, shrid2.y.y, TTTTTT))

#as_test <-
  #as_test %>%
  #select(-c(category, SS.x, `_target_weight_share.x`, YY.y, bond_lab_share, `_core_p_miss.y`, `_target_group_max_weight_share.y`, SS.x.x, `_target_group_max_weight_share`, SS.y.y, YY.x, `_core_p_miss.x`, `_target_group_max_weight_share.x`, SS.y, scav_share, `_target_weight_share.y`, YY.x.x, `_target_weight_share`, YY.y.y, year, shrid2.x, shrid2.x.x, shrid2.y, shrid2.y.y, TTTTTT))
#Coming up with a recipe for all the Assam models
as_rec <- recipe(formula = secc_pov_rate_rural ~ ., data = as_train) %>%
  step_zv(all_predictors()) %>%
  step_select(all_numeric()) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  prep()
#setting seed and v-fold cross-validation
set.seed(04281998)
folds <- vfold_cv(data = as_train, v = 10, repeats = 1) 
[Model 1: Linear Regression]
#create model object
lm_mod <- linear_reg() %>%
  set_engine("lm")

# create a workflow
lm_wf <- workflow() %>%
  add_recipe(as_rec) %>%
  add_model(lm_mod) 

#function needed to extract coefficients
get_lm_coefs <- function(x) {
  
  x %>% 
    extract_fit_engine() %>% 
    tidy()
}

tidy_ctrl <- control_grid(extract = get_lm_coefs)

#fit 
lm_cv <- lm_wf %>%
  fit_resamples(resamples = folds, control = tidy_ctrl)

#measure RMSE 
collect_metrics(lm_cv, summarize = FALSE) %>%
 filter(.metric == "rmse") %>%
 ggplot(aes(id, .estimate, group = .estimator)) +
 geom_line() +
 geom_point() +
 scale_y_continuous(limits = c(0, 0.5)) +
 labs(title = "Calculated RMSE Across the 10 Folds",
 y = "RMSE_hat") +
 theme_minimal()

#collecting metrics
metric_table_lm <- collect_metrics(lm_cv, summarize = TRUE)
metric_table_lm %>%
  filter(.metric == "rmse")
# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 rmse    standard   0.120     8 0.00153 Preprocessor1_Model1
#Getting coefficients
lm_coefs <- lm_cv %>% 
  select(id, .extracts) %>% 
  unnest(.extracts)  %>% 
  unnest(.extracts)

#Getting coefficients for night light
lm_coefs %>%
  filter(term == "viirs_annual_mean")
# A tibble: 8 × 7
  id     term               estimate std.error statistic p.value .config        
  <chr>  <chr>                 <dbl>     <dbl>     <dbl>   <dbl> <chr>          
1 Fold01 viirs_annual_mean -0.000690   0.00104    -0.665   0.506 Preprocessor1_…
2 Fold02 viirs_annual_mean -0.000652   0.00104    -0.626   0.531 Preprocessor1_…
3 Fold03 viirs_annual_mean -0.000590   0.00103    -0.572   0.567 Preprocessor1_…
4 Fold05 viirs_annual_mean -0.000396   0.00104    -0.380   0.704 Preprocessor1_…
5 Fold06 viirs_annual_mean -0.000574   0.00103    -0.555   0.579 Preprocessor1_…
6 Fold07 viirs_annual_mean -0.000682   0.00104    -0.656   0.512 Preprocessor1_…
7 Fold08 viirs_annual_mean -0.000568   0.00104    -0.546   0.585 Preprocessor1_…
8 Fold09 viirs_annual_mean -0.000589   0.00104    -0.567   0.571 Preprocessor1_…
[Model 2: LASSO]
#Lasso model
lasso_grid <- grid_regular(penalty(), levels = 10)

lasso_mod <- linear_reg(
  penalty = tune(), 
  mixture = 1,
) %>%
  set_engine("glmnet")

#creating wf
lasso_wf <- workflow() %>%
  add_recipe(as_rec) %>%
  add_model(lasso_mod) 

#fitting wf 
lasso_cv <- lasso_wf %>%
  tune_grid(
    resamples = folds,
    grid = lasso_grid)

#measure RMSE 
collect_metrics(lasso_cv, summarize = FALSE) %>%
 filter(.metric == "rmse") %>%
 ggplot(aes(id, .estimate, group = .estimator)) +
 geom_line() +
 geom_point() +
 scale_y_continuous(limits = c(0, 0.5)) +
 labs(title = "Calculated RMSE Across the 10 Folds",
 y = "RMSE_hat") +
 theme_minimal()

#collecting metrics
metric_table_lasso <- collect_metrics(lasso_cv, summarize = TRUE)
metric_table_lasso %>%
  filter(.metric == "rmse")
# A tibble: 10 × 7
         penalty .metric .estimator  mean     n std_err .config              
           <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1 0.0000000001  rmse    standard   0.120     8 0.00151 Preprocessor1_Model01
 2 0.00000000129 rmse    standard   0.120     8 0.00151 Preprocessor1_Model02
 3 0.0000000167  rmse    standard   0.120     8 0.00151 Preprocessor1_Model03
 4 0.000000215   rmse    standard   0.120     8 0.00151 Preprocessor1_Model04
 5 0.00000278    rmse    standard   0.120     8 0.00151 Preprocessor1_Model05
 6 0.0000359     rmse    standard   0.120     8 0.00149 Preprocessor1_Model06
 7 0.000464      rmse    standard   0.120     8 0.00118 Preprocessor1_Model07
 8 0.00599       rmse    standard   0.125     8 0.00119 Preprocessor1_Model08
 9 0.0774        rmse    standard   0.172     8 0.00110 Preprocessor1_Model09
10 1             rmse    standard   0.198     8 0.00105 Preprocessor1_Model10
#selecting model with lowest RMSE
lasso_best <- lasso_cv %>%
  select_best(metric = "rmse")

#finalizing first wf 
lasso_final <- finalize_workflow(
  lasso_wf,
  parameters = lasso_best
)

lasso_coefs <- lasso_final %>%
  fit(data = as_train) %>%
  extract_fit_parsnip() %>%
  vi(lambda = lasso_best$penalty) 

#Importance of night light variable 
lasso_coefs %>%
  filter(Variable == "viirs_annual_mean") 
# A tibble: 1 × 3
  Variable          Importance Sign 
  <chr>                  <dbl> <chr>
1 viirs_annual_mean   0.000228 NEG  
[Assam: Determining best model]
bind_rows(
  `lm` = show_best(lm_cv, metric = "rmse", n = 1),
  `lasso` = show_best(lasso_cv, metric = "rmse", n = 1),
  .id = "model")
# A tibble: 2 × 8
  model .metric .estimator  mean     n std_err .config                 penalty
  <chr> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                     <dbl>
1 lm    rmse    standard   0.120     8 0.00153 Preprocessor1_Model1  NA       
2 lasso rmse    standard   0.120     8 0.00118 Preprocessor1_Model07  0.000464
#fitting LASSO on testing data 
#lasso_fit <- lasso_final %>%
#  fit(data = as_test)

#create predictions df
#predictions_lasso <-
#  bind_cols(
#    test,
#    stats::predict(object = lasso_fit, new_data = as_test)
#  )

# calculate the rmse on the testing data: 
#rmse(data = predictions_lasso, truth = secc_pov_rate_rural, estimate = .pred)

B. Madhya Pradesh

We created linear regression and LASSO models to predict rural poverty rates in Madhya Pradesh. Both models use 10-fold cross validation. For both linear regression and LASSO, we selected the models with the smallest RMSE.

The best linear regression model (i.e. the one with the lowest RMSE) is Preprocessor 1 Model 1, with a RMSE of 0.1269577. Meanwhile, the best LASSO model (i.e. the one with the lowest RMSE) is Preprocessor 1 Model 6, with a RMSE of 0.1267754. The best LASSO model includes a penalty of 3.593814e-05.

For the linear regression model, we demonstrated how to get estimated coefficients for viirs_annual_mean, i.e. the night light variable. There are 10 different estimates, one for each of the folds. Most of the coefficients are somewhere between -0.0035 and -0.0045 and have p-values smaller than 0.05.

We conclude that the effect of night lights in the linear regression models for Madhya Pradesh are statistically significant. However, the coefficients are relatively small and so there is reason to think that the results are not substantively significant. In other words, night lights might not be especially predictive of rural poverty rates in Madhya Pradesh, even if results happen to be statistically significant.

For the LASSO model, we demonstrated how to find the importance of viirs_annual_mean, i.e. the night light variable. For importance, it has a coefficient of 0.0005925927 and a negative sign. The very small coefficient provides further evidence that night light is not very predictive of rural poverty rates in Madhya Pradesh.

We feel that the LASSO model should be our final model for predicting poverty in Madhya Pradesh, since the LASSO model has a slightly smaller RMSE than the linear regression model (0.1267754 vs. 0.1269577).

[Madhya Pradesh]

# Filter only numeric columns, dropping missing values
mp_train <- mp_train %>%
  select_if(is.numeric) %>%
  na.omit()

mp_test <- mp_test %>%
  select_if(is.numeric) %>%
  na.omit()
#Dropping a few more variables
mp_train <- mp_train %>%
  select(-c(viirs_annual_min, viirs_annual_max, viirs_annual_sum,
    viirs_annual_num_cells, secc_cons_pc_rural,
    secc_cons_pc_rural, secc_pov_rate_tend_rural, secc_cons_rural, secc_hh, 
    `_mean_p_miss.x`, `_mean_p_miss.y`, `_mean_p_miss`))
mp_test <- mp_test %>%
  select(-c(viirs_annual_min, viirs_annual_max, viirs_annual_sum,
    viirs_annual_num_cells, secc_cons_pc_rural,
    secc_cons_pc_rural, secc_pov_rate_tend_rural, secc_cons_rural, secc_hh,
    `_mean_p_miss.x`, `_mean_p_miss.y`, `_mean_p_miss`))
#Coming up with a recipe for all the Madhya Pradesh models
mp_rec <- recipe(formula = secc_pov_rate_rural ~ ., data = mp_train) %>%
  step_zv(all_predictors()) %>%
  step_center(all_predictors()) %>%
  step_scale(all_predictors()) %>%
  prep()
#setting seed and v-fold cross-validation
set.seed(04281998)
folds <- vfold_cv(data = mp_train, v = 10, repeats = 1) 
[Model 1: Linear Regression]
#create model object
lm_mod <- linear_reg() %>%
  set_engine("lm")

# create a workflow
lm_wf <- workflow() %>%
  add_recipe(mp_rec) %>%
  add_model(lm_mod) 

#function needed to extract coefficients
get_lm_coefs <- function(x) {
  
  x %>% 
    extract_fit_engine() %>% 
    tidy()
  
}

tidy_ctrl <- control_grid(extract = get_lm_coefs)

#fit 
lm_cv <- lm_wf %>%
  fit_resamples(resamples = folds, control = tidy_ctrl)

#measure RMSE 
collect_metrics(lm_cv, summarize = FALSE) %>%
 filter(.metric == "rmse") %>%
 ggplot(aes(id, .estimate, group = .estimator)) +
 geom_line() +
 geom_point() +
 scale_y_continuous(limits = c(0, 0.5)) +
 labs(title = "Calculated RMSE Across the 10 Folds",
 y = "RMSE_hat") +
 theme_minimal()

#collecting metrics
metric_table_lm <- collect_metrics(lm_cv, summarize = TRUE)
metric_table_lm %>%
  filter(.metric == "rmse")
# A tibble: 1 × 6
  .metric .estimator  mean     n  std_err .config             
  <chr>   <chr>      <dbl> <int>    <dbl> <chr>               
1 rmse    standard   0.127    10 0.000611 Preprocessor1_Model1
#Getting coefficients
lm_coefs <- lm_cv %>% 
  select(id, .extracts) %>% 
  unnest(.extracts)  %>% 
  unnest(.extracts)

#Getting coefficients for night light
lm_coefs %>%
  filter(term == "viirs_annual_mean")
# A tibble: 10 × 7
   id     term              estimate std.error statistic      p.value .config   
   <chr>  <chr>                <dbl>     <dbl>     <dbl>        <dbl> <chr>     
 1 Fold01 viirs_annual_mean -0.00406  0.000805     -5.05 0.000000455  Preproces…
 2 Fold02 viirs_annual_mean -0.00438  0.000810     -5.41 0.0000000643 Preproces…
 3 Fold03 viirs_annual_mean -0.00424  0.000811     -5.22 0.000000176  Preproces…
 4 Fold04 viirs_annual_mean -0.00358  0.000803     -4.45 0.00000853   Preproces…
 5 Fold05 viirs_annual_mean -0.00432  0.000818     -5.28 0.000000131  Preproces…
 6 Fold06 viirs_annual_mean -0.00452  0.000806     -5.61 0.0000000208 Preproces…
 7 Fold07 viirs_annual_mean -0.00420  0.000807     -5.21 0.000000190  Preproces…
 8 Fold08 viirs_annual_mean -0.00390  0.000808     -4.83 0.00000137   Preproces…
 9 Fold09 viirs_annual_mean -0.00440  0.000809     -5.44 0.0000000523 Preproces…
10 Fold10 viirs_annual_mean -0.00349  0.000819     -4.27 0.0000200    Preproces…
[Model 2: LASSO]
#Lasso model
lasso_grid <- grid_regular(penalty(), levels = 10)

lasso_mod <- linear_reg(
  penalty = tune(), 
  mixture = 1,
) %>%
  set_engine("glmnet")

#creating wf
lasso_wf <- workflow() %>%
  add_recipe(mp_rec) %>%
  add_model(lasso_mod) 

#fitting wf 
lasso_cv <- lasso_wf %>%
  tune_grid(
    resamples = folds,
    grid = lasso_grid)

#measure RMSE 
collect_metrics(lasso_cv, summarize = FALSE) %>%
 filter(.metric == "rmse") %>%
 ggplot(aes(id, .estimate, group = .estimator)) +
 geom_line() +
 geom_point() +
 scale_y_continuous(limits = c(0, 0.5)) +
 labs(title = "Calculated RMSE Across the 10 Folds",
 y = "RMSE_hat") +
 theme_minimal()

#collecting metrics
metric_table_lasso <- collect_metrics(lasso_cv, summarize = TRUE)
metric_table_lasso %>%
  filter(.metric == "rmse")
# A tibble: 10 × 7
         penalty .metric .estimator  mean     n  std_err .config              
           <dbl> <chr>   <chr>      <dbl> <int>    <dbl> <chr>                
 1 0.0000000001  rmse    standard   0.127    10 0.000615 Preprocessor1_Model01
 2 0.00000000129 rmse    standard   0.127    10 0.000615 Preprocessor1_Model02
 3 0.0000000167  rmse    standard   0.127    10 0.000615 Preprocessor1_Model03
 4 0.000000215   rmse    standard   0.127    10 0.000615 Preprocessor1_Model04
 5 0.00000278    rmse    standard   0.127    10 0.000615 Preprocessor1_Model05
 6 0.0000359     rmse    standard   0.127    10 0.000617 Preprocessor1_Model06
 7 0.000464      rmse    standard   0.127    10 0.000669 Preprocessor1_Model07
 8 0.00599       rmse    standard   0.134    10 0.000794 Preprocessor1_Model08
 9 0.0774        rmse    standard   0.183    10 0.000686 Preprocessor1_Model09
10 1             rmse    standard   0.217    10 0.000679 Preprocessor1_Model10
#selecting model with lowest RMSE
lasso_best <- lasso_cv %>%
  select_best(metric = "rmse")

#finalizing first wf 
lasso_final <- finalize_workflow(
  lasso_wf,
  parameters = lasso_best
)

lasso_coefs <- lasso_final %>%
  fit(data = as_train) %>%
  extract_fit_parsnip() %>%
  vi(lambda = lasso_best$penalty) 

#Importance of night light variable
lasso_coefs %>%
  filter(Variable == "viirs_annual_mean") 
# A tibble: 1 × 3
  Variable          Importance Sign 
  <chr>                  <dbl> <chr>
1 viirs_annual_mean   0.000593 NEG  
[Determining best model]
bind_rows(
  `lm` = show_best(lm_cv, metric = "rmse", n = 1),
  `lasso` = show_best(lasso_cv, metric = "rmse", n = 1),
  .id = "model")
# A tibble: 2 × 8
  model .metric .estimator  mean     n  std_err .config                  penalty
  <chr> <chr>   <chr>      <dbl> <int>    <dbl> <chr>                      <dbl>
1 lm    rmse    standard   0.127    10 0.000611 Preprocessor1_Model1  NA        
2 lasso rmse    standard   0.127    10 0.000617 Preprocessor1_Model06  0.0000359
#fitting LASSO on testing data 
#lasso_fit <- lasso_final %>%
#  fit(data = as_test)

#create predictions df
#predictions_lasso <-
#  bind_cols(
#    test,
#    stats::predict(object = lasso_fit, new_data = as_test)
#  )

# calculate the rmse on the testing data: 
#rmse(data = predictions_lasso, truth = secc_pov_rate_rural, estimate = .pred)
  1. Comparison between Assam and Madhya Pradesh

One difference between the two states is that the OLS coefficients for night lights are statistically significant for Madhya Pradesh but are not statistically significant for Assam. This suggests that night lights might be more predictive of rural poverty rates in Madhya Pradesh than in Assam.

  1. Classification

Our team created decision tree models to predict shrids with rural poverty rates of 50% or greater in Assam and Madhya Pradesh. In the next couple of paragraphs, we outline the process used to create the models. We also explain how the models can be used to explain the importance (or lack thereof) of night light data in predicting rural poverty rates.

The first step was to create a binary variable in both the Assam and Madhya Pradesh data. The variable was assigned a value of “Higher” if the rural poverty rate was 50% or greater and a value of “Lower” if the rural poverty rate was below 50%.

For both the Assam and Madhya Pradesh decision trees, we used night light and the share of households that own land as predictors. We deliberately did not want to include more predictors, so as to more easily assess the effect of night light in decision tree models.

  1. Assam

The decision tree model for Assam has an accuracy rate of 71%, which is relatively good. However, both the precision rate (56%) and the recall rate (23%) are relatively poor. The results suggest that the model does a bad job of classifying shrids into “higher” and “lower” poverty categories. Like the regression results, the decision tree results suggest that night lights are not very predictive of rural poverty rates in Assam.

[Classification] [Assam]

#Creating binary variable
#1: 50%+ of rural households in poverty, 0: under 50% of rural households in poverty
as_train$high_poverty <-
  ifelse(as_train$secc_pov_rate_rural >= 0.5, "Higher", "Lower")
as_test$high_poverty <-
  ifelse(as_test$secc_pov_rate_rural >= 0.5, "Higher", "Lower")
#Making high_poverty variable factors
as_train$high_poverty <- as.factor(as_train$high_poverty)
as_test$high_poverty <- as.factor(as_test$high_poverty)
# create a recipe
cart_rec_as <-
  recipe(formula = high_poverty ~ viirs_annual_mean + land_own_share, data = as_train)
# create a cart model object
cart_mod_classification_as <-
  decision_tree() %>%
  set_engine(engine = "rpart") %>%
  set_mode(mode = "classification")
cart_wf_classification_as <- workflow() %>%
  add_recipe(cart_rec_as) %>%
  add_model(cart_mod_classification_as)
# fit the model
cart_fit_classification_as <- 
  cart_wf_classification_as %>%
  fit(data = as_train)
#Visualizing decision tree
rpart.plot::rpart.plot(x = cart_fit_classification_as$fit$fit$fit)

#Adding predicted values and probabilities
  #as columns to testing dataset
predictions_as <- bind_cols(
  as_test,
  predict(object = cart_fit_classification_as, new_data = as_test),
  predict(object = cart_fit_classification_as, new_data = as_test, type = "prob")
)
#Creating confusion matrix
conf_mat(data = predictions_as,
truth = high_poverty,
estimate = .pred_class)
          Truth
Prediction Higher Lower
    Higher    447   352
    Lower    1478  4069
#Calculating accuracy
accuracy(data = predictions_as,
truth = high_poverty,
estimate = .pred_class)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.712

Accuracy: 71%

#Calculating precision
precision_vec(data = predictions_as,
truth = predictions_as$high_poverty,
estimate = predictions_as$.pred_class)
[1] 0.5594493

Precision: 56%

#Calculating recall
recall_vec(data = predictions_as,
truth = predictions_as$high_poverty,
estimate = predictions_as$.pred_class)
[1] 0.2322078

Recall: 23%

  1. Madhya Pradesh

The decision tree model for Madhya Pradesh has an accuracy rate of 71% and a precision rate of 64%, which are decent but not great results. However, it only has a recall rate of 57%.

Importantly, however, the Madhya Pradesh decision tree is better than the Assam decision tree at classifying shrids into “higher” and “lower” poverty categories. Perhaps night lights are more predictive of rural poverty rates in Madhya Pradesh than in Assam (as the regression results also suggest).

[Madhya Pradesh]

#Creating binary variable
#1: 50%+ of rural households in poverty, 0: under 50% of rural households in poverty
mp_train$high_poverty <-
  ifelse(mp_train$secc_pov_rate_rural >= 0.5, "Higher", "Lower")
mp_test$high_poverty <-
  ifelse(mp_test$secc_pov_rate_rural >= 0.5, "Higher", "Lower")
#Making high poverty column factors
mp_train$high_poverty <- as.factor(mp_train$high_poverty)
mp_test$high_poverty <- as.factor(mp_test$high_poverty)
# create a recipe
cart_rec_mp <-
  recipe(formula = high_poverty ~ viirs_annual_mean + land_own_share, data = mp_train)
# create a cart model object
cart_mod_classification_mp <-
  decision_tree() %>%
  set_engine(engine = "rpart") %>%
  set_mode(mode = "classification")
cart_wf_classification_mp <- workflow() %>%
  add_recipe(cart_rec_mp) %>%
  add_model(cart_mod_classification_mp)
# fit the model
cart_fit_classification_mp <- 
  cart_wf_classification_mp %>%
  fit(data = mp_train)
#Visualizing decision tree
rpart.plot::rpart.plot(x = cart_fit_classification_mp$fit$fit$fit)

#Adding predicted values and probabilities
  #as columns to testing dataset
predictions_mp <- bind_cols(
  mp_test,
  predict(object = cart_fit_classification_mp, new_data = mp_test),
  predict(object = cart_fit_classification_mp, new_data = mp_test, type = "prob")
)
#Creating decision tree
conf_mat(data = predictions_mp,
truth = high_poverty,
estimate = .pred_class)
          Truth
Prediction Higher Lower
    Higher   3410  1910
    Lower    2600  6144
#Calculating accuracy
accuracy(data = predictions_mp,
truth = high_poverty,
estimate = .pred_class)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.679

Accuracy: 68%

#Calculating precision
precision_vec(data = predictions_mp,
truth = predictions_mp$high_poverty,
estimate = predictions_mp$.pred_class)
[1] 0.6409774

Precision: 64%

recall_vec(data = predictions_mp,
truth = predictions_mp$high_poverty,
estimate = predictions_mp$.pred_class)
[1] 0.5673877

Recall: 57%

[Appendix] Assam: Model 1 - CART

#coming up with a cart mod
cart_mod <-
  decision_tree() %>%
  set_engine(engine = "rpart", model = TRUE) %>%
  set_mode(mode = "regression")

#Creating workflow 
cart_wf <- workflow() %>%
  add_recipe(as_rec) %>%
  add_model(cart_mod)

#Fitting the model
cart_fit <- cart_wf %>%
  fit(data = as_train)

#Creating a tree
tree_model <- rpart.plot::rpart.plot(x = cart_fit$fit$fit$fit, roundint = FALSE)

#Plotting Variable Importance Plot (VIP)
cart_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 10)

MP: Model 1 - CART

#coming up with a cart mod
cart_mod <-
  decision_tree() %>%
  set_engine(engine = "rpart") %>%
  set_mode(mode = "regression")

#Creating workflow 
cart_wf <- workflow() %>%
  add_recipe(mp_rec) %>%
  add_model(cart_mod)

#Fitting the model
cart_fit <- cart_wf %>%
  fit(data = mp_train)

#Creating a tree
tree_model <- rpart.plot::rpart.plot(x = cart_fit$fit$fit$fit)

#plotting vip
cart_fit %>%
extract_fit_parsnip() %>%
vip(num_features = 10)

CONCLUSION

In conclusion, night lights are a better predictor of rural poverty rates and categorization in Madhya Pradesh than in Assam. However, it is not a great predictor in even the Madhya Pradesh models.

One limitation of our analysis is that it just looks at two states in India. Perhaps there are states in which night light is a better predictor of rural poverty than in either Assam or Madhya Pradesh. Another shortcoming is that we are limited to a single measure of rural poverty (i.e., the share of rural households consuming less than 31 INR per day in 2012).

One area for future research is to use night lights to predict other measures of rural poverty and/or consumption. We might want to look at the whole country of India, not just two states.

WORKS CITED

Asher, Sam, et al. “Development research at high geographic resolution: an analysis of night-lights, firms, and poverty in India using the shrug open data platform.” The World Bank Economic Review 35.4 (2021): 845-871.

Chandrasekhar, S., Ajay Sharma, and Kathikeya Naraparaju (2021). Spatial disparities in household earnings in India. Ideas for India. https://www.ideasforindia.in/topics/poverty-inequality/spatial-disparities-in-household-earnings-in-india.html

Elbers, Chris, Jean O. Lanjouw, and Peter Lanjouw. “Micro-level estimation of poverty and inequality.” Econometrica 71.1 (2003): 355-364.

Elvidge, Christopher D., et al. “A global poverty map derived from satellite data.” Computers & Geosciences 35.8 (2009): 1652-1660.

Meenakshi, J. V., Ranjan Ray, and Souvik Gupta. “Estimates of poverty for SC, ST and female-headed households.” Economic and Political Weekly (2000): 2748-2754.

Mohanty, Abinash and Shreya WadhawanMapping India’s climate vulnerability. (2023, April 24). CEEW. https://www.ceew.in/publications/mapping-climate-change-vulnerability-index-of-india-a-district-level-assessment

Noor, Abdisalan M., et al. “Using remotely sensed night-time light as a proxy for poverty in Africa.” Population Health Metrics 6.1 (2008): 1-13.

Spada A, Fiore M, Galati A. The Impact of Education and Culture on Poverty Reduction: Evidence from Panel Data of European Countries. Soc Indic Res. 2023 Jun 14:1-14. doi: 10.1007/s11205-023-03155-0. Epub ahead of print. PMID: 37362180; PMCID: PMC10265551. Subash, Surendran Padmaja, Rajeev Ranjan Kumar, and Korekallu Srinivasa Aditya. “Satellite data and machine learning tools for predicting poverty in rural India.” Agricultural economics research review 31.2 (2018): 231-240.

Testbook. (2023, July 20). Poorest states in India 2023 – Get list of states by poverty rate Testbook. https://testbook.com/static-gk/poorest-state-in-india

UN India Poverty and urbanisation. (n.d.). India. https://india.un.org/en/171267-poverty-and-urbanisation

Wang, Wen, Hui Cheng, and Li Zhang. “Poverty assessment using DMSP/OLS night-time light satellite imagery at a provincial scale in China.” Advances in Space Research 49.8 (2012): 1253-1264.

Xu, Jianbin, et al. “Combining night time lights in prediction of poverty incidence at the county level.” Applied Geography 135 (2021): 102552.